It’s been a while since I’ve formally practiced any data science. Although this was the better part of my life from 2021 through 2024 while I was in grad school, I’m amazed how much taking a half-year break to travel can make me rusty. As a means to refresh my memory and help educate others, I’m exploring some synthetic data around student GPAs to demythologize A/B testing (student’s t-test, ANOVA, and their non-parametric counterparts), and teach some linear regression based feature selection methods.
This student habit and academic performance data was grabbed from Kaggle and features many useful predictor variables of different data types. I walk through my full analytic and EDA process in this Rmarkdown notebook starting with simple exploration below.
#prelim EDA Chunk, I observe data in note book after this to ideate some next questions
head(grades_and_features)
NA
# further investigations post prelim-EDA
#1. Are the student IDs true primary keys? Will output true if there are ANY repeats
print(any(duplicated(grades_and_features$student_id)))
[1] FALSE
#2. Are the text categorical features separable as expected (no extra classes besides on typos or classes that should roll into each other)?
grades_and_features %>%
select(where(is.character)) %>%
imap(~ {
cat("\nColumn:", .y, "\n")
print(unique(.x))
})
Column: gender
[1] "Male" "Other" "Female"
Column: major
[1] "Computer Science" "Arts" "Psychology" "Business"
[5] "Engineering" "Biology"
Column: part_time_job
[1] "Yes" "No"
Column: diet_quality
[1] "Poor" "Good" "Fair"
Column: parental_education_level
[1] "Some College" "PhD" "High School" "Master" "Bachelor"
Column: internet_quality
[1] "High" "Low" "Medium"
Column: extracurricular_participation
[1] "Yes" "No"
Column: dropout_risk
[1] "No" "Yes"
Column: study_environment
[1] "Co-Learning Group" "Library" "Quiet Room" "Dorm"
[5] "Cafe"
Column: access_to_tutoring
[1] "Yes" "No"
Column: family_income_range
[1] "High" "Low" "Medium"
Column: learning_style
[1] "Reading" "Kinesthetic" "Visual" "Auditory"
$gender
[1] "Male" "Other" "Female"
$major
[1] "Computer Science" "Arts" "Psychology" "Business"
[5] "Engineering" "Biology"
$part_time_job
[1] "Yes" "No"
$diet_quality
[1] "Poor" "Good" "Fair"
$parental_education_level
[1] "Some College" "PhD" "High School" "Master" "Bachelor"
$internet_quality
[1] "High" "Low" "Medium"
$extracurricular_participation
[1] "Yes" "No"
$dropout_risk
[1] "No" "Yes"
$study_environment
[1] "Co-Learning Group" "Library" "Quiet Room" "Dorm"
[5] "Cafe"
$access_to_tutoring
[1] "Yes" "No"
$family_income_range
[1] "High" "Low" "Medium"
$learning_style
[1] "Reading" "Kinesthetic" "Visual" "Auditory"
#3. are there any outliers in the numerical features that could potentially disproportionately affect future models? I.e. a clerical input error that we would not want to keep, rather than an outlier that could still be valid.
summary(grades_and_features %>% select(where(is.double)))
student_id age study_hours_per_day social_media_hours
Min. :100000 Min. :16 Min. : 0.000 Min. :0.000
1st Qu.:120000 1st Qu.:19 1st Qu.: 2.800 1st Qu.:1.200
Median :140000 Median :22 Median : 4.126 Median :2.500
Mean :140000 Mean :22 Mean : 4.174 Mean :2.501
3rd Qu.:159999 3rd Qu.:25 3rd Qu.: 5.500 3rd Qu.:3.800
Max. :179999 Max. :28 Max. :12.000 Max. :5.000
netflix_hours attendance_percentage sleep_hours exercise_frequency
Min. :0.000 Min. : 40.00 Min. : 4.000 Min. :0.000
1st Qu.:1.000 1st Qu.: 55.00 1st Qu.: 6.000 1st Qu.:2.000
Median :2.000 Median : 69.90 Median : 7.000 Median :4.000
Mean :1.998 Mean : 69.97 Mean : 7.017 Mean :3.517
3rd Qu.:3.000 3rd Qu.: 84.90 3rd Qu.: 8.000 3rd Qu.:6.000
Max. :4.000 Max. :100.00 Max. :12.000 Max. :7.000
mental_health_rating previous_gpa semester stress_level
Min. : 1.000 Min. :1.640 Min. :1.000 Min. : 1.000
1st Qu.: 5.500 1st Qu.:3.270 1st Qu.:2.000 1st Qu.: 3.600
Median : 6.900 Median :3.790 Median :5.000 Median : 5.000
Mean : 6.804 Mean :3.602 Mean :4.497 Mean : 5.012
3rd Qu.: 8.200 3rd Qu.:4.000 3rd Qu.:7.000 3rd Qu.: 6.400
Max. :10.000 Max. :4.000 Max. :8.000 Max. :10.000
social_activity screen_time parental_support_level motivation_level
Min. :0.0 Min. : 0.300 Min. : 1.000 Min. : 1.000
1st Qu.:1.0 1st Qu.: 7.800 1st Qu.: 3.000 1st Qu.: 3.000
Median :2.0 Median : 9.700 Median : 5.000 Median : 5.000
Mean :2.5 Mean : 9.673 Mean : 5.479 Mean : 5.489
3rd Qu.:4.0 3rd Qu.:11.600 3rd Qu.: 8.000 3rd Qu.: 8.000
Max. :5.0 Max. :21.000 Max. :10.000 Max. :10.000
exam_anxiety_score time_management_score exam_score
Min. : 5.000 Min. : 1.000 Min. : 36.00
1st Qu.: 7.000 1st Qu.: 3.200 1st Qu.: 82.00
Median :10.000 Median : 5.500 Median : 93.00
Mean : 8.508 Mean : 5.499 Mean : 89.14
3rd Qu.:10.000 3rd Qu.: 7.800 3rd Qu.:100.00
Max. :10.000 Max. :10.000 Max. :100.00
At a glance, the data is pretty clean. That makes sense given that it is synthetic data and was probably created programmatically with some random generation.
Let’s spit out and check some hypotheses utilizing A/B testing (t-test).
#1. women have demonstrably higher exam scores than men.
# Four assumptions underlying a t-test:
# A. The observed variable of comparison is normally distributed
# B. There is equal variance amongst our classes
# C. The observations are independent
# D. The observed variable is continuous
# checking normality
library(ggplot2)
ggplot(grades_and_features, aes(sample = exam_score)) +
stat_qq() + stat_qq_line() +
facet_wrap(~ gender)
# the distributions are not normal, therefore we should defer to some other non-parametric test. In this case we will use Wilcoxon Rank Sum Test.
# Assumptions
# a. the observations are independent
# b. the observed variable is continuous
# c. Identical shape of distribution (need not be normally distributed)
# d. Observations should be randomly sampled
wilcox.test(exam_score ~ gender, data = grades_and_features %>% filter(gender != 'Other'), alternative = "greater")
Wilcoxon rank sum test with continuity correction
data: exam_score by gender
W = 355030365, p-value = 0.795
alternative hypothesis: true location shift is greater than 0
# now that I know there's no meaningful difference between test scores of men and women, let's add our third gender option into the mix. Now I will hypothesize there is no meaningful difference, i.e. we should accept the null hypothesis
#2. Exam scores between gender categories are not statistically meaningfully different
kruskal.test(exam_score ~ gender, data = grades_and_features)
Kruskal-Wallis rank sum test
data: exam_score by gender
Kruskal-Wallis chi-squared = 1.0044, df = 2, p-value = 0.6052
I’ve proven to myself that, statistically speaking, there is no meaningful difference between the test scores between gender categories. I’m being pedantic for educational purposes - that was a given. Let’s do one more take, of something I might actually be convinced is true - whilst remembering this data is synthetic and observed synthetic patterns might not match expected reality.
#3. Exam scores are observably different between parent education levels.
kruskal.test(exam_score ~ parental_education_level, data = grades_and_features)
Kruskal-Wallis rank sum test
data: exam_score by parental_education_level
Kruskal-Wallis chi-squared = 0.39266, df = 4, p-value = 0.9831
Alright, well my hypothesis was wrong. We see that I can do this for every single categorical variable to determine which (if any) have an effect on exam_score. This would be a very crude and rudimentary method of feature selection. There are better methods, of course. I’ll be walking through those, but first working that from another angle: feature removal.
We can check for multi-collinearity amongst features using VIF.
cor_matrix <- cor(grades_and_features[sapply(grades_and_features, is.numeric)], use = "complete.obs")
library(corrplot)
G3;corrplot 0.95 loaded
g
corrplot::corrplot(cor_matrix, method = "color", tl.cex = 0.7)
Keep in mind, this synthetic data might not match real-world expectations, so the interpretations might likewise be void of meaning.
# removing features based on some observed multi-collinearity
grades_and_features <- grades_and_features %>% select(-screen_time, -motivation_level, -student_id)
Next I want to begin talking about the regression based feature selection methods, but that involves a basic understanding in linear regression, so let’s do that.
lm(exam_score ~ ., grades_and_features) %>% summary()
Call:
lm(formula = exam_score ~ ., data = grades_and_features)
Residuals:
Min 1Q Median 3Q Max
-23.2372 -2.4576 0.7929 1.7227 20.6819
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.806e+00 2.356e-01 20.399 <2e-16 ***
age 2.020e-04 3.941e-03 0.051 0.9591
genderMale -2.324e-02 3.613e-02 -0.643 0.5201
genderOther -1.835e-02 3.616e-02 -0.507 0.6119
majorBiology -6.059e-02 5.109e-02 -1.186 0.2357
majorBusiness 1.035e-02 5.102e-02 0.203 0.8393
majorComputer Science -9.559e-03 5.144e-02 -0.186 0.8526
majorEngineering 2.412e-02 5.107e-02 0.472 0.6368
majorPsychology 3.851e-02 5.087e-02 0.757 0.4490
study_hours_per_day 7.342e-03 7.785e-03 0.943 0.3457
social_media_hours 2.106e-03 1.021e-02 0.206 0.8366
netflix_hours 1.082e-02 1.277e-02 0.847 0.3969
part_time_jobYes -2.757e-03 2.952e-02 -0.093 0.9256
attendance_percentage 6.567e-05 8.516e-04 0.077 0.9385
sleep_hours 6.284e-05 1.012e-02 0.006 0.9950
diet_qualityGood -5.718e-02 3.300e-02 -1.733 0.0832 .
diet_qualityPoor -4.582e-02 4.425e-02 -1.036 0.3004
exercise_frequency 2.553e-03 6.476e-03 0.394 0.6935
parental_education_levelHigh School 6.594e-02 4.661e-02 1.415 0.1572
parental_education_levelMaster -9.923e-03 4.675e-02 -0.212 0.8319
parental_education_levelPhD 1.215e-02 4.677e-02 0.260 0.7951
parental_education_levelSome College 4.842e-03 4.660e-02 0.104 0.9172
internet_qualityLow -6.905e-02 3.611e-02 -1.912 0.0558 .
internet_qualityMedium -3.098e-03 3.617e-02 -0.086 0.9317
mental_health_rating -3.372e-04 7.737e-03 -0.044 0.9652
extracurricular_participationYes -1.322e-03 2.952e-02 -0.045 0.9643
previous_gpa 2.337e+01 3.600e-02 649.111 <2e-16 ***
semester 5.062e-05 6.430e-03 0.008 0.9937
stress_level 6.340e-03 7.991e-03 0.793 0.4275
dropout_riskYes -1.164e-01 1.115e-01 -1.044 0.2963
social_activity -5.617e-03 8.660e-03 -0.649 0.5165
study_environmentCo-Learning Group 5.910e-03 4.674e-02 0.126 0.8994
study_environmentDorm -4.319e-02 4.679e-02 -0.923 0.3561
study_environmentLibrary -1.882e-02 4.677e-02 -0.402 0.6875
study_environmentQuiet Room -3.172e-02 4.683e-02 -0.677 0.4982
access_to_tutoringYes 3.539e-04 2.983e-02 0.012 0.9905
family_income_rangeLow 2.001e-02 3.614e-02 0.554 0.5799
family_income_rangeMedium 1.579e-02 3.617e-02 0.436 0.6625
parental_support_level -8.856e-04 5.137e-03 -0.172 0.8631
exam_anxiety_score 1.846e-02 8.602e-03 2.146 0.0319 *
learning_styleKinesthetic 1.192e-02 4.178e-02 0.285 0.7754
learning_styleReading 2.917e-02 4.185e-02 0.697 0.4858
learning_styleVisual -5.822e-02 4.182e-02 -1.392 0.1639
time_management_score -4.490e-03 5.670e-03 -0.792 0.4284
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.174 on 79956 degrees of freedom
Multiple R-squared: 0.8704, Adjusted R-squared: 0.8704
F-statistic: 1.249e+04 on 43 and 79956 DF, p-value: < 2.2e-16
I think a thorough explanation of linear regression is a little outside the scope of this blog post and requires some foundational knowledge of linear algebra, calculus, and stats/probability. It’s not in my interest to talk at length about these, but I’ll try to give a layman’s primer on linear regression.
Hopefully you remember algebraic equation for a line:
\[ y = mx + b \]
For any two variables that we hypothesize have a linear relationship, we can assume for a given value of x, there will be a statistically reasonable predicted value of y. And we can even visualize this easy enough with a scatter plot. See below where all the points are the actual (x, y) coordinates of these variables and the blue line is our regressor.
ggplot(data = grades_and_features, aes(x = previous_gpa, y = exam_score)) + geom_point(alpha = 0.2) + geom_smooth(method = 'lm') + labs(title = 'Exam Score as a function of Previous GPA', xlab = 'Previous GPA' , ylab = 'Exam Score')
And here we have it: a line of “best fit” that summarizes the relationship between the previous GPA and the current exam score. Trust that there are mathematical procedures of optimization occurring here that guarantee this is the line of best fit, or the line that best summarizes a linear relationship between these two variables. It’s not perfect of course - the relationship between previous GPA and exam score is clearly not a perfectly linear one, or then the scatter plot would be a perfect line instead of a mass of points that can be generally understood by a line.
This implies that the relationship between these two is not solely linear and that other variables (combined with previous GPA) may be necessary to accurately predict exam score. And now we move on to consider multi-variable linear regression.
To briefly explain how multi-variable linear regression works: the same as regular linear regression. Linear regression as a technique is generalizable to any amount of dimensions. the line of best fit is actually a hyperplane. Because we can only meaningfully perceive three dimensions, here’s what multiple linear regression looks like using previous GPA and exam anxiety score to predict exam scores looks like.
library(plotly)
G3;Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
gG3;
Attaching package: ‘plotly’
gG3;The following object is masked from ‘package:ggplot2’:
last_plot
gG3;The following object is masked from ‘package:stats’:
filter
gG3;The following object is masked from ‘package:graphics’:
layout
g
# Fit the model
mlm_model <- lm(exam_score ~ previous_gpa + exam_anxiety_score, data = grades_and_features)
# Create a grid for the plane using the real variable names
x1_seq <- seq(min(grades_and_features$previous_gpa, na.rm = TRUE),
max(grades_and_features$previous_gpa, na.rm = TRUE), length.out = 30)
x2_seq <- seq(min(grades_and_features$exam_anxiety_score, na.rm = TRUE),
max(grades_and_features$exam_anxiety_score, na.rm = TRUE), length.out = 30)
grid <- expand.grid(previous_gpa = x1_seq, exam_anxiety_score = x2_seq)
grid$exam_score <- predict(mlm_model, newdata = grid)
# Create 3D plot
plot_ly() %>%
add_markers(data = grades_and_features, x = ~previous_gpa, y = ~exam_anxiety_score, z = ~exam_score,
marker = list(size = 2), name = "Observed Data") %>%
add_surface(x = ~x1_seq, y = ~x2_seq,
z = ~matrix(grid$exam_score, nrow = length(x1_seq), byrow = TRUE),
showscale = FALSE, opacity = 0.5, name = "Regression Plane") %>%
layout(scene = list(
xaxis = list(title = "Previous GPA"),
yaxis = list(title = "Exam Anxiety Score"),
zaxis = list(title = "Exam Score")
))
NA
And now we have a hyperplane of best fit that shows the relationships between previous GPA, exam anxiety score, and exam score. And still, this relationship is still not perfectly linear - given the previous GPA and the exam anxiety score, we can’t predict the exam score with 100% certainty.
Data science is often described as both an art and a science. The art lies in recognizing that we may never fully capture the relationship between all possible predictors and an outcome like exam score. We may lack key variables, or the underlying patterns may be nonlinear or complex. But the science lies in our ability to make informed, methodical progress—using statistical tools and the scientific method—to build models that are useful, even if imperfect.
So now, I’ll begin to go through some of the scientific methods that inform feature selection. We’ve already gone through one: A/B testing. Determining whether a response variable is significantly impacted by the values of some other categorical predictor variable is the crux of the A/B, ANOVA and their non-parametric counterparts. We could iterate through all our categorical variables applying the right test for each one, but this is a little rudimentary and doesn’t account for the numerical variables.
Allow me to discuss a few other feature selection methods built upon linear regression. First, I’ll need to explain some of the output of linear regression. Let’s return to the two dimensional regression predicting exam score by way of previous GPA. This time I’ll apply some standard handling conventions in preparation to develop a finalized model.
# forming a training+validation / test split
#80 goes to training and validation
set.seed(512)
trainval_indices <- createDataPartition(grades_and_features$exam_score, p = 0.8, list = F)
trainval_data <- grades_and_features[trainval_indices, ]
test_data <- grades_and_features[-trainval_indices, ]
# cross-validation object
cv_obj <- vfold_cv(trainval_data, v = 5)
Beginning with stepwise regression, we add/remove one variable at time from the full model with all variables in search of a choice that will continue to reduce a metric known as the Akaike Information Criterion (AIC), just a number for comparing the efficiency of models whilst holding them accountable for the amount of predictors they have.
# scaling the test and training data
rec <- recipe(exam_score ~ . , data = trainval_data) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_predictors())
prep_rec <- prep(rec, training = trainval_data)
trainval_processed <- bake(prep_rec, new_data = trainval_data)
test_processed <- bake(prep_rec, new_data = test_data)
From here we run our stepwise regression model.
# a full model to stepwise train from
full_model <- lm('exam_score ~ .' , data = trainval_processed)
# the stepwise regression derived lm
stepwise_model <- stats::step(full_model, direction = 'both', trace = 0)
The stepwise regression model is relatively easy to obtain (from a coding perspective), at a later point we can train the derived model on the full training data and test on the test data. For now, we’ll move on to training a LASSO regression model.
lasso_spec <- linear_reg(penalty = tune(), mixture = 1) %>%
set_engine("glmnet")
lasso_wf <- workflow() %>%
add_model(lasso_spec) %>%
add_recipe(rec)
# Choose grid of penalty values (can be customized)
lambda_grid <- tibble(penalty = c(1, 0.1, 0.01))
# grid_regular(penalty(range = c(-4, 0)), levels = 30)
# LASSO tuning
lasso_res <- tune_grid(
lasso_wf,
resamples = cv_obj,
grid = lambda_grid,
metrics = metric_set(rmse, rsq, mae)
)
best_lasso <- select_best(lasso_res, metric = "rmse")
lasso_final_wf <- finalize_workflow(lasso_wf, best_lasso)
lasso_model <- fit(lasso_final_wf, data = trainval_data)
lasso_model_fit <- extract_fit_parsnip(lasso_model)
lasso_model_fit$fit$call
glmnet::glmnet(x = maybe_matrix(x), y = y, family = "gaussian",
alpha = ~1)
# Extract glmnet model
glmnet_fit <- lasso_model_fit$fit
# Use best lambda
best_lambda <- best_lasso$penalty
# Get the coefficient matrix at that lambda (this is a sparse matrix)
coef_mat <- coef(glmnet_fit, s = best_lambda)
# Convert to tidy tibble manually
library(tibble)
library(dplyr)
selected_features <- as.matrix(coef_mat) %>%
as_tibble(rownames = "term") %>%
rename(estimate = s0) %>%
filter(estimate != 0)
selected_features
The LASSO regression model determines that the most necessary variable(s) in relation to exam_score is solely previous_gpa. Whereas our stepwise regression model picked the variables previous_gpa, and one-hot dummy encoded flavors of the internet_quality, study_environment, and learning_style variable. Both of these regression techniques have acted as variable selection methodologies, similar in how we could have applied A/B or ANOVA testing across a large swath of our categorical variables to perform feature selection.
Now all that’s left is to ask of ourselves which model performs best?
# generating LASSO predictions
lasso_preds <- predict(lasso_model, new_data = test_data) %>%
bind_cols(test_data %>% select(exam_score))
# generating stepwise predictions
test_for_stepwise <- bake(prep(rec, training = trainval_data), new_data = test_data)
stepwise_preds <- predict(stepwise_model, newdata = test_for_stepwise) %>%
bind_cols(test_data %>% select(exam_score)) %>% rename('.pred' = '...1')
New names:
• `` -> `...1`
# calculating rmse of both
lasso_rmse <- sqrt(sum((lasso_preds$.pred - lasso_preds$exam_score)^2) / nrow(lasso_preds))
stepwise_rmse <- sqrt(sum((stepwise_preds$.pred - stepwise_preds$exam_score)^2) / nrow(stepwise_preds))